Quality and Complexity Measures for Data Linkage and Deduplication

نویسندگان

Peter Christen

Karl Goiser

چکیده

Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessing Deduplication and Data Linkage Quality: What to Measure?

Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research ...

متن کامل

An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework

Record linkage is an important process in data integration, which is used in merging, matching and duplicate removal from several databases that refer to the same entities. Deduplication is the process of removing duplicate records in a single database. In recent years, data cleaning and standardization becomes an important process in data mining task. Due to complexity of today’s database, fin...

متن کامل

Clustering Quality Measures

Aiming towards the development of a general clustering theory, addressing issues that are common to the different clustering paradigms, we wish to initiate a systematic study of measures for the quality of a given data clustering. A clustering quality measure is a function that, given a data set and its partition into clusters, returns a non-negative real number representing the quality of that...

متن کامل

Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage

Record linkage is the task of identifying records from disparate data sources that refer to the same entity. It is an integral component of data processing in distributed settings, where the integration of information from multiple sources can prevent duplication and enrich overall data quality, thus enabling more detailed and correct analysis. Privacy-preserving record linkage (PPRL) is a vari...

متن کامل

Tuning & Recommended Related Evolution Approaches for Distributed Databases

--Today’s databases are complex databases with duplicates. Due to complexity database we introduce the tuning and recommendation techniques. Tuning and recommendation process is important task in data integration task. Different existing system techniques like record matching, record linkage detects the same entities in single database. Deduplication removes the duplicates in single database. T...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Quality and Complexity Measures for Data Linkage and Deduplication

نویسندگان

چکیده

منابع مشابه

Assessing Deduplication and Data Linkage Quality: What to Measure?

An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework

Clustering Quality Measures

Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage

Tuning & Recommended Related Evolution Approaches for Distributed Databases

عنوان ژورنال:

اشتراک گذاری